Skip to content

Conversation

@bzhangGo
Copy link
Contributor

What does this PR do?

Add support for T5Gemma2 with multi-modal and long-context capability.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Rocketknight1
Copy link
Member

Hey! Is this a pre-release model? I don't see the checkpoints like google/t5gemma-2-270m anywhere

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@bzhangGo
Copy link
Contributor Author

Hey! Is this a pre-release model? I don't see the checkpoints like google/t5gemma-2-270m anywhere
yeah, coming soon

@Rocketknight1
Copy link
Member

cc @ArthurZucker in that case!

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an initial review from my side

3 main issues I would have are

  • We definitely should be able to have a SWA for bidirectional masks directly in the utils.
  • The process fn to process pixel values should not be passed around imo, it should be handled a level above
  • We should have one level for the encoder, decoder models to be wrapped under language model

@bzhangGo
Copy link
Contributor Author

@vasqu Thanks for the comments. Re the major issues,

We definitely should be able to have a SWA for bidirectional masks directly in the utils.

I agree with this move but I think this is better done by transformers side since it's a big behavior change. wdyt?

The process fn to process pixel values should not be passed around imo, it should be handled a level above

This relates to how the transformers handle generation for encoder-decoder models. Two abnormal behaviors:

  1. the cache was created by generationmix, not the model: https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1924
  2. the encoder outputs was created from encoder directly, not the model: https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L902.

This is the main reason that we have some wired designs in T5Gemma2, including the dynamic adjustment for sliding window size, and the special handle of the vision preprocessor. Please let me know if you have any suggestions!

We should have one level for the encoder, decoder models to be wrapped under language model

I'm not sure if it's a good idea to have another level of wrapper for encoder-decoders, as it's common to put encoders and decoders into the model jointly, like T5/Bart/T5Gemma.

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key change should be how we handle caches and the encoder with vision preprocessing (get image features etc).

Regarding generation related issue, for now we should override the respective functions in the generation mixin from which we inherit. I agree that we ideally have proper logic in our code, but I want to postpone this for now and fix this properly in the future; small overrides are fine and we always encounter these here and there already.

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, looking already pretty good. There are no big issues tbh, it's just few smaller issues for standards

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for iterating! I left some last comments which are mostly things to shorten / simplify things.

If we could remove some test overrides, that would be awesome. Atm, it's a lot which is unideal but also not the world.

@vasqu
Copy link
Contributor

vasqu commented Nov 17, 2025

cc @ArthurZucker @Cyrilvallez for core maintainer

This is an encoder-decoder model with multimodal capabilities. To properly interact with our generation pipeline, the vision backbone is within the encoder.

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, t5gemma, t5gemma2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants